{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 写在前面 :)\n",
    "\n",
    "**☞项目名称：** 探索ISUX\n",
    "\n",
    "**☞数据加值宣言：**本项目产出按**‘设计’，‘用户体验’，‘社交’等41个分类**挖掘**腾讯ISUX**公众号关于**腾讯社交用户体验设计**的内容选取前55页（共**275篇文章**）数据，以解决UI设计师、用户体验师、产品经理等职位从不同分类查找文章了解腾讯社交用户体验设计的问题。\n",
    "\n",
    "**☞MVP数据加值：** \n",
    "【此产品面对对象：对腾讯社交用户体验设计感兴趣的人】<br>\n",
    "1、挖掘腾讯ISUX公众号前55页的文章数据，解决想要一览最近文章的公众号关注者的需求<br>\n",
    "2、以在公众号出现频率、用户体验有关的关键词分类前55页文章数据，解决UI设计师、用户体验师、产品经理等职位从不同分类查找文章的需求<br><br>\n",
    "【共41个分类，其余归为无法分类】<br>\n",
    "分类有：腾讯， QQ，CF，PUPU，微云，Qzone，小程序，微视，设计，游戏，动画，H5，可视化，3D，CSS，UX，ISUX，社交，用户，需求，用户体验，文字，品牌设计，鹅粉投稿，企鹅，故事，福利，回顾，大赛，论坛，招聘，区块链，原型，大数据，趋势，广告，报告，策略，思维，情感，原创，创意，直播，用户研究，设定\n",
    "\n",
    "**☞挖掘公众号基本信息：** 腾讯ISUX：腾讯社交用户体验设计团队，负责腾讯社交平台、社交应用及社交娱乐等产品的体验、服务、创意设计。力争成为国际化艺术与设计的最热潮流IP，并打造YCG原创馆设计平台，构建设计生态，致力于推动扩大设计行业价值与影响力。\n",
    "\n",
    "**☞两种方法输出表格**\n",
    "\n",
    "方法一输出：腾讯ISUX_requests_url<br>\n",
    "方法二输出：腾讯ISUX_Selenium_url"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "公众号 = \"腾讯ISUX\"\n",
    "fn = { \"output\" : { \"公众号_htm_snippets\": \"weixin/data_raw_src/{公众号}_htm_snippets.tsv\",\n",
    "                    \"公众号_df\": \"weixin/data_raw_src/{公众号}_df.tsv\",\n",
    "                    \"公众号_xlsx\": \"weixin/data_sets/{公众号}_url.xlsx\" } \\\n",
    "      }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 方法一：Requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "250\n",
      "255\n",
      "260\n",
      "265\n",
      "270\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "import requests\n",
    "import pandas as pd\n",
    "import csv\n",
    "\n",
    "url = \"https://mp.weixin.qq.com/cgi-bin/appmsg\"\n",
    "\n",
    "# 使用Cookie，跳过登陆操作\n",
    "# 注意更换Cookie，否则无法抓取多次\n",
    "headers = {\n",
    "    \"Cookie\":\"pgv_pvi=7978336256; ptui_loginuin=2670022802; RK=hehxiTBX1v; ptcz=ddf8a4921b6844ac47a198955164973097a97574f9fb9d7bc7d8de969fca7adb; UM_distinctid=1719158ce6dba6-0ad37913dbac38-5313f6f-e1000-1719158ce6eba8; pgv_pvid=2415988394; tvfe_boss_uuid=0bfe27a6f161b587; o_cookie=2670022802; pac_uid=1_2670022802; noticeLoginFlag=1; remember_acct=2670022802%40qq.com; pt_local_token=123456789; bizuin=3897075556; ua_id=GMHflRB1EEXU7TdiAAAAALLLg96fhXvauPumDi39CQU=; pgv_si=s5352884224; CNZZDATA1272960370=753410043-1589333207-%7C1589591947; CNZZDATA1272425418=761714792-1589333543-%7C1589591946; uuid=6eac67b28992afa9fc7c8d759f4c08df; ticket=3abe0ac122dcfb98d0853ce948055ae596f46adf; ticket_id=gh_f31c64ffd764; cert=eCjLb_U20lKqX07G6PhfDhVQeJ1qFwkK; rand_info=CAESIF2Uy4pO62Ldb/mZywE7ZQEVNzcuXTKUp1Ld3gFdyDYr; slave_bizuin=3897075556; data_bizuin=3897075556; data_ticket=5sOmT8nzcJngqb4e0gzDIuv61/gIwmlfIlyx4kBi0SR6B3OegUmCcDlBCskiF8VX; slave_sid=a19abWpqUm5jZ0g2T1ZYakRxT09ZOVVGSjUwYWpJQzVYTW1VRm9tQ3J3WXRZWmdSNW1hUW9XYXRHSUQ5OWJoTm9DRlVOOGdObEZaSWtTRU9ZNUdHaVZGSjhWVlFobUh3blJDZE8wZF9GMjZfYUVsckxXT2dpZlc4TDNyNUxNNTlKMzI3QVpkYXFYcjR5eUZ4; slave_user=gh_f31c64ffd764; xid=a64d4f751ea6e0e14bc95c8df85b934d; openid2ticket_oZKdI6GkpTJfQh4a0XeKJLF9Ce9c=I5Qu+E6m4lbNsND1n7YAFbVcxfpkWl9/oKrFZxyQ/NU=; mm_lang=zh_CN\",\n",
    "    \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36\"}\n",
    "\n",
    "data = {\n",
    "    \"token\": \"403913006\",\n",
    "    \"lang\": \"zh_CN\",\n",
    "    \"f\": \"json\",\n",
    "    \"ajax\": \"1\",\n",
    "    \"action\": \"list_ex\",\n",
    "    \"begin\": \"0\",\n",
    "    \"count\": \"5\",\n",
    "    \"query\": \"\",\n",
    "    \"fakeid\": \"MjM5NzQxMDkwMg==\",#公众号的信息\n",
    "    \"type\": \"9\",\n",
    "}\n",
    "\n",
    "content_list=[]\n",
    "\n",
    "# 每页begin会变5\n",
    "# 分开抓，每次改变range  #抓到55页\n",
    "for i in range(50,55):\n",
    "    data[\"begin\"] = i*5 \n",
    "    print(data[\"begin\"])\n",
    "    time.sleep(3)\n",
    "    # 使用get方法进行提交\n",
    "    content_json = requests.get(url, headers=headers, params=data).json()\n",
    "#   print(content_json)\n",
    "    # 返回了一个json，里面是每一页的数据\n",
    "    for item in content_json[\"app_msg_list\"]:\n",
    "    # 提取每页文章的标题及对应的url\n",
    "        items = []\n",
    "        items.append(item[\"title\"])\n",
    "        items.append(item[\"link\"])\n",
    "        items.append(item[\"create_time\"])\n",
    "        content_list.append(items)\n",
    "\n",
    "\n",
    "name=['title','link','create_time']\n",
    "test=pd.DataFrame(columns=name,data=content_list)\n",
    "with pd.ExcelWriter(fn[\"output\"][\"公众号_xlsx\"].format(公众号=\"腾讯ISUX_requests\")) as writer:\n",
    "    test.to_excel(writer)\n",
    "\n",
    "#test.to_csv(\"../weixin/腾讯ISUX.csv\",mode='a',encoding='utf-8')\n",
    "#print(\"保存成功\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 方法二：Selenium"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "selenium 是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。\n",
    "* Selenium库的基本使用：https://www.jianshu.com/p/3aa45532e179"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from lxml.html import fromstring\n",
    "import time\n",
    "from random import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "H:\\python\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:18: DeprecationWarning: use options instead of chrome_options\n"
     ]
    }
   ],
   "source": [
    "from selenium import webdriver #python自动抓取数据的模块\n",
    "from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\n",
    "\n",
    "#caps=dict()\n",
    "#caps[\"pageLoadStrategy\"] = \"none\"   # Do not wait for full page load\n",
    "\n",
    "opts = webdriver.ChromeOptions()\n",
    "opts.add_argument('--no-sandbox')#解决DevToolsActivePort文件不存在的报错\n",
    "opts.add_argument('window-size=1920x3000') #指定浏览器分辨率\n",
    "opts.add_argument('--disable-gpu') #谷歌文档提到需要加上一这个属性来规避bug\n",
    "opts.add_argument('--hide-scrollbars') #隐藏滚动条, 应对些特殊页面\n",
    "#opts.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度\n",
    "#opts.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败\n",
    "\n",
    "#在此之前要加环境变量C:\\Program Files (x86)\\Google\\Chrome\\Application\n",
    "opts.binary_location = r\"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe\" \n",
    "\n",
    "driver = webdriver.Chrome( chrome_options = opts) #desired_capabilities=caps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取网页\n",
    "driver.get(\"https://mp.weixin.qq.com\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第一步：填表登陆"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "selenium 的定位方法\n",
    "* find_element_by_id &ensp;&ensp;&ensp;  根据标签id定位\n",
    "* find_element_by_name   &ensp;&ensp;&ensp; 根据标签的name定位\n",
    "* find_element_by_xpath  &ensp;&ensp;&ensp; 根据xpath定位\n",
    "* find_element_by_link_text  &ensp;&ensp;&ensp; 通过文字链接来定位元素\n",
    "* find_element_by_partial_link_text  &ensp;&ensp;&ensp;  通过文字链接来定位元素\n",
    "* find_element_by_tag_name  &ensp;&ensp;&ensp;  根据标签的名字定位\n",
    "* find_element_by_class_name  &ensp;&ensp;&ensp; 通过class name 定位\n",
    "* find_element_by_css_selector  &ensp;&ensp;&ensp;  根据元素属性来定位"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "WebDriver 常用方法：\n",
    "* clear()清除文本\n",
    "* send_keys(values)模拟按键输入\n",
    "* click()模拟点击\n",
    "* submit模拟提交"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload =  {\"account\": \"2670022802@qq.com\", \"password\": \"2000923hyt.\"}\n",
    "\n",
    "#登录框xpath--模拟输入登陆操作\n",
    "driver.find_element_by_xpath('//div[@class=\"login__type__container login__type__container__scan\"]/a').click()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"account\"]').clear()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"account\"]').send_keys(payload['account'])\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"password\"]').clear()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"password\"]').send_keys(payload['password'])\n",
    "driver.find_element_by_xpath('//div[@class=\"login_btn_panel\"]/a').click()\n",
    "\n",
    "# 后扫码验证"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第二步：点击左上方选单"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其他常用方法\n",
    "* size：返回元素的尺寸\n",
    "* text：获取元素的文本\n",
    "* get_attribute：获取属性值  &ensp;&ensp;&ensp; get_attribute('innerHTML')获取元素内的全部HTML\n",
    "* is_displayed()：设置该元素用户是否可见"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'展开'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//a[@id=\"m_open\"]')\n",
    "element.click()\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.execute_script(\"window.scrollTo(0,document.body.scrollHeight)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://mp.weixin.qq.com/cgi-bin/appmsg?begin=0&count=10&t=media/appmsg_list&type=10&action=list&token=473634445&lang=zh_CN'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//li[@title[contains(.,\"素材管理\")]]/a') \n",
    "# main_content = element.get_attribute('innerHTML')\n",
    "# main_content\n",
    "url_2= element.get_attribute(\"href\")\n",
    "url_2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.get(url_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第三步：点击新建图文消息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"新建图文消息\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "main_content\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['CDwindow-9CDA56ABAF6EECAD9F621C35DD8C2338', 'CDwindow-9B63D54177FB5DDDE591180B82883CBA']\n"
     ]
    }
   ],
   "source": [
    "print (driver.window_handles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "switch_to的常用用法\n",
    "* driver.switch_to.window(window_name) &ensp;&ensp;&ensp;切换到制定的window_name页面\n",
    "* driver.switch_to.alert() &ensp;&ensp;&ensp;切换到alert弹窗\n",
    "* driver.switch_to.active_element() &ensp;&ensp;&ensp;定位到当前聚焦的元素上\n",
    "* driver.switch_to.default_content() &ensp;&ensp;&ensp;切换到最上层页面（主文档？）\n",
    "* driver.switch_to.frame(frame_reference) &ensp;&ensp;&ensp;通过id、name、element(定位的某个元素)、索引来切换到某个frame\n",
    "* driver.switch_to.parent_frame() &ensp;&ensp;&ensp;这是switch_to中独有的方法，可以切换到上一层的frame，对于层层嵌套的frame很有用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新建图文消息开了另一分视窗，所以要切换 switch_to \n",
    "driver.switch_to.window(driver.window_handles[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第四步：超链接"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                超链接              \n",
      "选择其他公众号\n"
     ]
    }
   ],
   "source": [
    "# 点击-超链接\n",
    "# 坑：注意不要把窗口折叠，否则xpath会不一样\n",
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"超链接\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()\n",
    "\n",
    "# 点击-选择其他公众号\n",
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"选择其他公众号\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 输入要搜索的公众号\n",
    "driver.find_element_by_xpath('//form//div[@class=\"inner_link_account_area\"]//input[@class=\"weui-desktop-form__input\"]').clear()\n",
    "driver.find_element_by_xpath('//form//div[@class=\"inner_link_account_area\"]//input[@class=\"weui-desktop-form__input\"]').send_keys(公众号)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-icon weui-desktop-icon__inputSearch weui-desktop-icon__small\"><!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <svg width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" xmlns=\"http://www.w3.org/2000/svg\"><path d=\"M11.33 10.007l4.273 4.273a.502.502 0 0 1 .005.709l-.585.584a.499.499 0 0 1-.709-.004L10.046 11.3a6.278 6.278 0 1 1 1.284-1.294zm.012-3.729a5.063 5.063 0 1 0-10.127 0 5.063 5.063 0 0 0 10.127 0z\"></path></svg> <!----> <!----> <!----> <!----></div>\n"
     ]
    }
   ],
   "source": [
    "# 点放大镜搜\n",
    "element = driver.find_element_by_xpath('//button[@class=\"weui-desktop-icon-btn weui-desktop-search__btn\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/cibketMByvrbMctLUP7tLkMkJFiav7Ldm5HUNXnNxyW3ia3JRNbfwog4sibf5f3ayrw5aPsLJplVBa7twENYF35WOA/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">腾讯ISUX</strong> <i class=\"inner_link_account_wechat\">微信号：tencent_isux</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/fvRiaWw7ItBI3NBxaVkNFeJxuG4qBkKkTolNuQQ6662ricx7LfF3g5dBhRMQYmLYaPHDxQ0OGoKdZ2icoibh4Isr3Q/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">isux智能门锁</strong> <i class=\"inner_link_account_wechat\">微信号：未设置</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li>\n"
     ]
    }
   ],
   "source": [
    "# Search Engine Results Page\n",
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "公众号SERP = main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nickname</th>\n",
       "      <th>wechat</th>\n",
       "      <th>img</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>腾讯ISUX</td>\n",
       "      <td>微信号：tencent_isux</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/cibketMByvrbMct...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>isux智能门锁</td>\n",
       "      <td>微信号：未设置</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/fvRiaWw7ItBI3NB...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   nickname            wechat  \\\n",
       "0    腾讯ISUX  微信号：tencent_isux   \n",
       "1  isux智能门锁           微信号：未设置   \n",
       "\n",
       "                                                 img  \n",
       "0  http://mmbiz.qpic.cn/mmbiz_png/cibketMByvrbMct...  \n",
       "1  http://mmbiz.qpic.cn/mmbiz_png/fvRiaWw7ItBI3NB...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 解析--找到目标公众号\n",
    "root = fromstring(公众号SERP) \n",
    "\n",
    "# 每个结果的xpath\n",
    "主 = root.xpath('//li[@class=\"inner_link_account_item\"]')\n",
    "\n",
    "account_list = []\n",
    "for e in 主:\n",
    "    account_nickname = e.xpath('./div/strong[@class=\"inner_link_account_nickname\"]')[0].text\n",
    "    account_wechat = e.xpath('./div/i[@class=\"inner_link_account_wechat\"]')[0].text\n",
    "    account_img = e.xpath('./div/img/@src')[0]\n",
    "    account = {\"nickname\": account_nickname, \"wechat\": account_wechat, \"img\": account_img,}\n",
    "    account_list.append(account)\n",
    "\n",
    "df_account = pd.DataFrame(account_list)\n",
    "display(df_account)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/cibketMByvrbMctLUP7tLkMkJFiav7Ldm5HUNXnNxyW3ia3JRNbfwog4sibf5f3ayrw5aPsLJplVBa7twENYF35WOA/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">腾讯ISUX</strong> <i class=\"inner_link_account_wechat\">微信号：tencent_isux</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div>\n"
     ]
    }
   ],
   "source": [
    "# 点击-目标公众号\n",
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]/li')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n跳转_input = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/input\\')\\n跳转_a = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/a\\')\\n跳转_input.clear()\\n跳转_input.send_keys(2)\\n跳转_a.click()\\n'"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 跳转testing\n",
    "'''\n",
    "跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "跳转_input.clear()\n",
    "跳转_input.send_keys(2)\n",
    "跳转_a.click()\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 125]\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "# 页面跳转上限\n",
    "l_e = driver.find_elements_by_xpath('//label[@class=\"weui-desktop-pagination__num\"]')\n",
    "l_e_int  = [int(x.text) for x in l_e] \n",
    "print (l_e_int)\n",
    "print (l_e_int[0]==l_e_int[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125]\n"
     ]
    }
   ],
   "source": [
    "# 查页码范围\n",
    "pages = list(range(l_e_int[0],l_e_int[-1]+1 ))\n",
    "print(pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]\n"
     ]
    }
   ],
   "source": [
    "# 只抓55页\n",
    "pages = list(range(1,56 ))\n",
    "print(pages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第五步：循环遍历"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 全局变量\n",
    "html_raw = dict()\n",
    "main_content =\"\"\n",
    "element = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t11\t12\t13\t14\t15\t16\t17\t18\t19\t20\t21\t22\t23\t24\t25\t26\t27\t28\t29\t30\t31\t32\t33\t34\t35\t36\t37\t38\t39\t40\t41\t42\t43\t44\t45\t46\t47\t48\t49\t50\t51\t52\t53\t54\t55\t"
     ]
    }
   ],
   "source": [
    "def process_pages (pages):\n",
    "    for p in pages:\n",
    "        print (p,end='\\t')\n",
    "\n",
    "        跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "        跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "        跳转_input.clear()\n",
    "        跳转_input.send_keys(p)\n",
    "        跳转_a.click()\n",
    "\n",
    "        time.sleep(45+120*random())\n",
    "\n",
    "        element = driver.find_element_by_xpath('//div[@class=\"inner_link_article_list\"]')\n",
    "        main_content = element.get_attribute('innerHTML') #获取对象的内容\n",
    "        #print(main_content)\n",
    "        html_raw[p] = main_content\n",
    "\n",
    "process_pages(pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored 'html_raw' (dict)\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame([html_raw]).T\n",
    "df.columns = [\"html_snippets\"]\n",
    "\n",
    "%store html_raw\n",
    "import pickle \n",
    "filehandler = open(\"html_raw\", 'wb') \n",
    "pickle.dump(html_raw, filehandler) #通过dump把处理好的数据序列化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "55\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>html_snippets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [html_snippets]\n",
       "Index: []"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_out = df[~df.duplicated()]\n",
    "print (len(df_out))#不重复的行数\n",
    "df[df.duplicated()]#重复"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 重复的，index是？\n",
    "try_again = list(df[df.duplicated()].index)\n",
    "print(try_again)\n",
    "try_again = try_again + list (set(pages).difference(set(df.index.values)))\n",
    "try_again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第六步：暂存档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = fn [\"output\"] [\"公众号_htm_snippets\"] \n",
    "df_out.to_csv(filename.format(公众号=公众号), sep=\"\\t\", encoding=\"utf8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,"
     ]
    }
   ],
   "source": [
    "def parse_html_snippets(_snippet_):\n",
    "    root = fromstring(_snippet_) \n",
    "    title = [x.text for x in root.xpath('//div[@class=\"inner_link_article_title\"]')]\n",
    "    create_time = [x.text for x in root.xpath('//div[@class=\"inner_link_article_date\"]')]\n",
    "    link = [x for x in root.xpath('//a/@href')]\n",
    "    _df_ = pd.DataFrame({\"标题\":title, \"时间\": create_time, \"文章链接\":link})\n",
    "    return(_df_)\n",
    "    \n",
    "l_df = []\n",
    "for p in pages:\n",
    "    _df_ = parse_html_snippets(df.loc[p,\"html_snippets\"])\n",
    "    print (len(_df_), end=\",\")\n",
    "    l_df.append(_df_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>时间</th>\n",
       "      <th>文章链接</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>轻盈娱乐 | QQ个性化商城改版</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>鹅粉投稿 | CF企鹅拆箱视频</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3D探索 | 卡噗内容趋势设定</td>\n",
       "      <td>2020-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>福利赠送 | 蜜桃扑扑设计故事</td>\n",
       "      <td>2020-04-29</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>QQ &amp; SF 首度联名创作</td>\n",
       "      <td>2020-04-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 标题          时间  \\\n",
       "0  轻盈娱乐 | QQ个性化商城改版  2020-05-14   \n",
       "1   鹅粉投稿 | CF企鹅拆箱视频  2020-05-12   \n",
       "2   3D探索 | 卡噗内容趋势设定  2020-05-07   \n",
       "3   福利赠送 | 蜜桃扑扑设计故事  2020-04-29   \n",
       "4    QQ & SF 首度联名创作  2020-04-22   \n",
       "\n",
       "                                                文章链接  \n",
       "0  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "1  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "2  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "3  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "4  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  "
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out = pd.concat(l_df).reset_index(drop=True)\n",
    "df_url_out.head(5) #显示前5条"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>时间</th>\n",
       "      <th>文章链接</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>270</th>\n",
       "      <td>QQ国际版视觉探索-Insight . Create</td>\n",
       "      <td>2017-02-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>271</th>\n",
       "      <td>办公应用的正确打开方式-基于TIM项目的反思与探索</td>\n",
       "      <td>2017-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>272</th>\n",
       "      <td>Emoji絵文字／えもじ -- 多终端适配！</td>\n",
       "      <td>2017-02-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>273</th>\n",
       "      <td>H5动画开发快车道 - AnimateCC与createjs开发实践</td>\n",
       "      <td>2017-02-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>搞点新意思－QQiPad主题带你飞</td>\n",
       "      <td>2017-02-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     标题          时间  \\\n",
       "270          QQ国际版视觉探索-Insight . Create  2017-02-14   \n",
       "271           办公应用的正确打开方式-基于TIM项目的反思与探索  2017-02-09   \n",
       "272              Emoji絵文字／えもじ -- 多终端适配！  2017-02-04   \n",
       "273  H5动画开发快车道 - AnimateCC与createjs开发实践  2017-02-03   \n",
       "274                   搞点新意思－QQiPad主题带你飞  2017-02-02   \n",
       "\n",
       "                                                  文章链接  \n",
       "270  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "271  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "272  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "273  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "274  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  "
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out.tail(5) #显示后5条"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>时间</th>\n",
       "      <th>文章链接</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>序</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>踏一池春水，等天朗气清</td>\n",
       "      <td>2020-04-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>网络上的另一个我 | 00后人设剖析</td>\n",
       "      <td>2020-03-31</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>每个表情都是宅家的我</td>\n",
       "      <td>2020-02-21</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>增强体质 | 多福带你学习八段锦</td>\n",
       "      <td>2020-02-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>如何在新春送出最炸的祝福</td>\n",
       "      <td>2020-01-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>The Power of Warmth</td>\n",
       "      <td>2019-12-23</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Begin the Adventure</td>\n",
       "      <td>2019-12-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>高定白太空鹅品牌视频</td>\n",
       "      <td>2019-12-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>那些欲罢不能的实用工具</td>\n",
       "      <td>2019-11-28</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>NEVER GIVE UP</td>\n",
       "      <td>2019-11-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>七星宇航探索</td>\n",
       "      <td>2019-11-18</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66</th>\n",
       "      <td>【预售发布】太空鹅联盟系列盲盒6+1</td>\n",
       "      <td>2019-08-26</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>【高定限量版】太空鹅手办众筹发布</td>\n",
       "      <td>2019-07-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>国际波普大师RON ENGLISH携新作震撼来袭！</td>\n",
       "      <td>2019-06-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>独家  |  陈漫的图像哲学</td>\n",
       "      <td>2019-05-23</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>点滴匠心，声入人心</td>\n",
       "      <td>2019-05-10</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115</th>\n",
       "      <td>你有一份「除夕特礼」待签收！</td>\n",
       "      <td>2019-02-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>116</th>\n",
       "      <td>为你准备了一份讨红包利器</td>\n",
       "      <td>2019-02-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>用这份高颜值红包开启你的RICH年</td>\n",
       "      <td>2019-01-18</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>这碗腊八豆子你舍得吃吗？</td>\n",
       "      <td>2019-01-13</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125</th>\n",
       "      <td>八周年，感谢有你</td>\n",
       "      <td>2019-01-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>128</th>\n",
       "      <td>冬至 | 今天夜最长，多点陪伴</td>\n",
       "      <td>2018-12-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129</th>\n",
       "      <td>[独家专访] 米高的公仔世界</td>\n",
       "      <td>2018-12-20</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>137</th>\n",
       "      <td>Who is YCG</td>\n",
       "      <td>2018-11-24</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>153</th>\n",
       "      <td>Tencent 20th Box Tee</td>\n",
       "      <td>2018-09-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>160</th>\n",
       "      <td>一步一步让你“PICK ME”</td>\n",
       "      <td>2018-08-21</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>分享图片</td>\n",
       "      <td>2018-07-05</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>超现实主义的人间乐园</td>\n",
       "      <td>2018-04-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>200</th>\n",
       "      <td>寻找闹市中的艺术净土</td>\n",
       "      <td>2018-03-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>203</th>\n",
       "      <td>旺年开工大吉</td>\n",
       "      <td>2018-03-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>207</th>\n",
       "      <td>What You See Is What You Get</td>\n",
       "      <td>2018-01-31</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>213</th>\n",
       "      <td>极速适配 iPhone X 秘笈</td>\n",
       "      <td>2018-01-10</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>214</th>\n",
       "      <td>新年快乐！</td>\n",
       "      <td>2018-01-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>218</th>\n",
       "      <td>表格边框你知多少</td>\n",
       "      <td>2017-10-26</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>222</th>\n",
       "      <td>记住了！9月5号开始，我们一起进化！Go~</td>\n",
       "      <td>2017-08-31</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>228</th>\n",
       "      <td>个性萌宠诞生记－动作篇</td>\n",
       "      <td>2017-08-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>个性萌宠诞生记－形象篇</td>\n",
       "      <td>2017-08-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>233</th>\n",
       "      <td>萌宠来袭--空间宠物品牌影像</td>\n",
       "      <td>2017-08-01</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240</th>\n",
       "      <td>欢迎来到后 ASO 时代</td>\n",
       "      <td>2017-07-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <td>Pet To Joy</td>\n",
       "      <td>2017-05-31</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <td>“从心出发”品牌企划-SNG五周年</td>\n",
       "      <td>2017-05-25</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>vuejs初体验-Chrome插件开发实录</td>\n",
       "      <td>2017-05-24</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>每个人心中都有一件白T</td>\n",
       "      <td>2017-05-17</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>252</th>\n",
       "      <td>《大唐荣耀2》开虐了，小鲜肉到齐了，都出自他的手笔！</td>\n",
       "      <td>2017-04-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>254</th>\n",
       "      <td>玲珑有致，富于肉感，“穆哈风格”在这里再现</td>\n",
       "      <td>2017-03-30</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>255</th>\n",
       "      <td>惊艳！这是好莱坞的大片吗？</td>\n",
       "      <td>2017-03-28</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257</th>\n",
       "      <td>深挖data URI性能瓶颈</td>\n",
       "      <td>2017-03-23</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "      <td>浏览器亚像素渲染与小数位的取舍</td>\n",
       "      <td>2017-03-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>271</th>\n",
       "      <td>办公应用的正确打开方式-基于TIM项目的反思与探索</td>\n",
       "      <td>2017-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               标题          时间  \\\n",
       "序                                               \n",
       "7                     踏一池春水，等天朗气清  2020-04-03   \n",
       "8              网络上的另一个我 | 00后人设剖析  2020-03-31   \n",
       "16                     每个表情都是宅家的我  2020-02-21   \n",
       "18               增强体质 | 多福带你学习八段锦  2020-02-16   \n",
       "21                   如何在新春送出最炸的祝福  2020-01-22   \n",
       "30            The Power of Warmth  2019-12-23   \n",
       "32            Begin the Adventure  2019-12-16   \n",
       "38                     高定白太空鹅品牌视频  2019-12-02   \n",
       "39                    那些欲罢不能的实用工具  2019-11-28   \n",
       "41                  NEVER GIVE UP  2019-11-25   \n",
       "44                         七星宇航探索  2019-11-18   \n",
       "66             【预售发布】太空鹅联盟系列盲盒6+1  2019-08-26   \n",
       "75               【高定限量版】太空鹅手办众筹发布  2019-07-11   \n",
       "83      国际波普大师RON ENGLISH携新作震撼来袭！  2019-06-16   \n",
       "90                 独家  |  陈漫的图像哲学  2019-05-23   \n",
       "95                      点滴匠心，声入人心  2019-05-10   \n",
       "115                你有一份「除夕特礼」待签收！  2019-02-04   \n",
       "116                  为你准备了一份讨红包利器  2019-02-02   \n",
       "122             用这份高颜值红包开启你的RICH年  2019-01-18   \n",
       "124                  这碗腊八豆子你舍得吃吗？  2019-01-13   \n",
       "125                      八周年，感谢有你  2019-01-11   \n",
       "128               冬至 | 今天夜最长，多点陪伴  2018-12-22   \n",
       "129                [独家专访] 米高的公仔世界  2018-12-20   \n",
       "137                    Who is YCG  2018-11-24   \n",
       "153          Tencent 20th Box Tee  2018-09-11   \n",
       "160               一步一步让你“PICK ME”  2018-08-21   \n",
       "178                          分享图片  2018-07-05   \n",
       "197                    超现实主义的人间乐园  2018-04-11   \n",
       "200                    寻找闹市中的艺术净土  2018-03-16   \n",
       "203                        旺年开工大吉  2018-03-02   \n",
       "207  What You See Is What You Get  2018-01-31   \n",
       "213              极速适配 iPhone X 秘笈  2018-01-10   \n",
       "214                         新年快乐！  2018-01-04   \n",
       "218                      表格边框你知多少  2017-10-26   \n",
       "222         记住了！9月5号开始，我们一起进化！Go~  2017-08-31   \n",
       "228                   个性萌宠诞生记－动作篇  2017-08-11   \n",
       "230                   个性萌宠诞生记－形象篇  2017-08-09   \n",
       "233                萌宠来袭--空间宠物品牌影像  2017-08-01   \n",
       "240                  欢迎来到后 ASO 时代  2017-07-11   \n",
       "245                    Pet To Joy  2017-05-31   \n",
       "246             “从心出发”品牌企划-SNG五周年  2017-05-25   \n",
       "247         vuejs初体验-Chrome插件开发实录  2017-05-24   \n",
       "249                   每个人心中都有一件白T  2017-05-17   \n",
       "252    《大唐荣耀2》开虐了，小鲜肉到齐了，都出自他的手笔！  2017-04-07   \n",
       "254         玲珑有致，富于肉感，“穆哈风格”在这里再现  2017-03-30   \n",
       "255                 惊艳！这是好莱坞的大片吗？  2017-03-28   \n",
       "257                深挖data URI性能瓶颈  2017-03-23   \n",
       "260               浏览器亚像素渲染与小数位的取舍  2017-03-14   \n",
       "271     办公应用的正确打开方式-基于TIM项目的反思与探索  2017-02-09   \n",
       "\n",
       "                                                  文章链接  \n",
       "序                                                       \n",
       "7    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "8    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "16   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "18   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "21   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "30   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "32   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "38   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "39   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "41   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "44   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "66   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "75   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "83   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "90   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "95   http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "115  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "116  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "122  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "124  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "125  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "128  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "129  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "137  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "153  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "160  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "178  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "197  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "200  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "203  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "207  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "213  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "214  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "218  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "222  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "228  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "230  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "233  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "240  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "245  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "246  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "247  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "249  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "252  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "254  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "255  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "257  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "260  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "271  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  "
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# tagging 标记\n",
    "tagging_list = [\"\",\"腾讯\", \"QQ\",\"CF\",\"PUPU\",\"微云\",\"Qzone\",\"小程序\",\"微视\",\\\n",
    "                \"设计\",\"游戏\",\"动画\",\"H5\",\"可视化\",\"3D\",\"CSS\",\"UX\",\\\n",
    "                \"ISUX\",\"社交\",\"用户\",\"需求\",\"用户体验\",\"文字\",\"品牌设计\",\\\n",
    "                \"鹅粉投稿\",\"企鹅\",\"故事\",\"福利\",\\\n",
    "                \"回顾\",\"大赛\",\"论坛\",\"招聘\",\\\n",
    "                \"区块链\",\"原型\",\"大数据\",\"趋势\",\"广告\",\\\n",
    "                \"报告\",\"策略\",\"思维\",\"情感\",\"原创\",\\\n",
    "                \"创意\",\"直播\",\"用户研究\",\"设定\"] \n",
    "\n",
    "v_v_list = []\n",
    "\n",
    "for tag in tagging_list:\n",
    "    index_list = df_url_out [ df_url_out.标题.str.contains(tag) ].index.tolist()\n",
    "    v_v_pairs = pd.DataFrame({tag:index_list}).melt().set_index(\"value\")\n",
    "    v_v_pairs.index.name= '序'\n",
    "    v_v_list.append(v_v_pairs)\n",
    "\n",
    "df_cat = v_v_list[0]\n",
    "for d in v_v_list:\n",
    "    df_cat.update(d)\n",
    "    \n",
    "# 尚未标记内容\n",
    "df_url_out.loc [ df_cat.query('variable==\"\"').index ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>时间</th>\n",
       "      <th>文章链接</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [标题, 时间, 文章链接]\n",
       "Index: []"
      ]
     },
     "execution_count": 158,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查重复行\n",
    "df_url_out[df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>时间</th>\n",
       "      <th>文章链接</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>轻盈娱乐 | QQ个性化商城改版</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>鹅粉投稿 | CF企鹅拆箱视频</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3D探索 | 卡噗内容趋势设定</td>\n",
       "      <td>2020-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>福利赠送 | 蜜桃扑扑设计故事</td>\n",
       "      <td>2020-04-29</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>QQ &amp; SF 首度联名创作</td>\n",
       "      <td>2020-04-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>270</th>\n",
       "      <td>QQ国际版视觉探索-Insight . Create</td>\n",
       "      <td>2017-02-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>271</th>\n",
       "      <td>办公应用的正确打开方式-基于TIM项目的反思与探索</td>\n",
       "      <td>2017-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>272</th>\n",
       "      <td>Emoji絵文字／えもじ -- 多终端适配！</td>\n",
       "      <td>2017-02-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>273</th>\n",
       "      <td>H5动画开发快车道 - AnimateCC与createjs开发实践</td>\n",
       "      <td>2017-02-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>搞点新意思－QQiPad主题带你飞</td>\n",
       "      <td>2017-02-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>275 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     标题          时间  \\\n",
       "0                      轻盈娱乐 | QQ个性化商城改版  2020-05-14   \n",
       "1                       鹅粉投稿 | CF企鹅拆箱视频  2020-05-12   \n",
       "2                       3D探索 | 卡噗内容趋势设定  2020-05-07   \n",
       "3                       福利赠送 | 蜜桃扑扑设计故事  2020-04-29   \n",
       "4                        QQ & SF 首度联名创作  2020-04-22   \n",
       "..                                  ...         ...   \n",
       "270          QQ国际版视觉探索-Insight . Create  2017-02-14   \n",
       "271           办公应用的正确打开方式-基于TIM项目的反思与探索  2017-02-09   \n",
       "272              Emoji絵文字／えもじ -- 多终端适配！  2017-02-04   \n",
       "273  H5动画开发快车道 - AnimateCC与createjs开发实践  2017-02-03   \n",
       "274                   搞点新意思－QQiPad主题带你飞  2017-02-02   \n",
       "\n",
       "                                                  文章链接  \n",
       "0    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "1    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "2    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "3    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "4    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "..                                                 ...  \n",
       "270  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "271  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "272  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "273  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "274  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  \n",
       "\n",
       "[275 rows x 3 columns]"
      ]
     },
     "execution_count": 159,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 不重复\n",
    "df_url_out[~df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>时间</th>\n",
       "      <th>文章链接</th>\n",
       "      <th>类型</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>轻盈娱乐 | QQ个性化商城改版</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>QQ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>鹅粉投稿 | CF企鹅拆箱视频</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>企鹅</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3D探索 | 卡噗内容趋势设定</td>\n",
       "      <td>2020-05-07</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>设定</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>福利赠送 | 蜜桃扑扑设计故事</td>\n",
       "      <td>2020-04-29</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>福利</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>QQ &amp; SF 首度联名创作</td>\n",
       "      <td>2020-04-22</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>QQ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>270</th>\n",
       "      <td>QQ国际版视觉探索-Insight . Create</td>\n",
       "      <td>2017-02-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>QQ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>271</th>\n",
       "      <td>办公应用的正确打开方式-基于TIM项目的反思与探索</td>\n",
       "      <td>2017-02-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>272</th>\n",
       "      <td>Emoji絵文字／えもじ -- 多终端适配！</td>\n",
       "      <td>2017-02-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>文字</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>273</th>\n",
       "      <td>H5动画开发快车道 - AnimateCC与createjs开发实践</td>\n",
       "      <td>2017-02-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>H5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>搞点新意思－QQiPad主题带你飞</td>\n",
       "      <td>2017-02-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...</td>\n",
       "      <td>QQ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>275 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     标题          时间  \\\n",
       "0                      轻盈娱乐 | QQ个性化商城改版  2020-05-14   \n",
       "1                       鹅粉投稿 | CF企鹅拆箱视频  2020-05-12   \n",
       "2                       3D探索 | 卡噗内容趋势设定  2020-05-07   \n",
       "3                       福利赠送 | 蜜桃扑扑设计故事  2020-04-29   \n",
       "4                        QQ & SF 首度联名创作  2020-04-22   \n",
       "..                                  ...         ...   \n",
       "270          QQ国际版视觉探索-Insight . Create  2017-02-14   \n",
       "271           办公应用的正确打开方式-基于TIM项目的反思与探索  2017-02-09   \n",
       "272              Emoji絵文字／えもじ -- 多终端适配！  2017-02-04   \n",
       "273  H5动画开发快车道 - AnimateCC与createjs开发实践  2017-02-03   \n",
       "274                   搞点新意思－QQiPad主题带你飞  2017-02-02   \n",
       "\n",
       "                                                  文章链接    类型  \n",
       "0    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    QQ  \n",
       "1    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    企鹅  \n",
       "2    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    设定  \n",
       "3    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    福利  \n",
       "4    http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    QQ  \n",
       "..                                                 ...   ...  \n",
       "270  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    QQ  \n",
       "271  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...  无法分类  \n",
       "272  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    文字  \n",
       "273  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    H5  \n",
       "274  http://mp.weixin.qq.com/s?__biz=MjM5NzQxMDkwMg...    QQ  \n",
       "\n",
       "[275 rows x 4 columns]"
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 具体内容+分类标记\n",
    "df_o = df_url_out.join(df_cat).replace(\"\", np.nan).fillna(\"无法分类\")\n",
    "df = df_o.rename(columns={\"variable\":\"类型\"})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>数量</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>类型</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>设计</th>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>无法分类</th>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ISUX</th>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>设定</th>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>原创</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>QQ</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>企鹅</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>故事</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>趋势</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>福利</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>招聘</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>大数据</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>社交</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3D</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>微云</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>策略</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>思维</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>大赛</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PUPU</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UX</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>用户</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>游戏</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>H5</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>腾讯</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>创意</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>动画</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>鹅粉投稿</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>品牌设计</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>情感</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>原型</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>文字</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>广告</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>用户体验</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>用户研究</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>直播</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>回顾</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>报告</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Qzone</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CSS</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>微视</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>需求</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>小程序</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       数量\n",
       "类型       \n",
       "设计     54\n",
       "无法分类   49\n",
       "ISUX   26\n",
       "设定     23\n",
       "原创     15\n",
       "QQ     13\n",
       "企鹅     11\n",
       "故事      7\n",
       "趋势      6\n",
       "福利      5\n",
       "招聘      5\n",
       "大数据     5\n",
       "社交      4\n",
       "3D      3\n",
       "微云      3\n",
       "策略      3\n",
       "思维      3\n",
       "大赛      3\n",
       "PUPU    3\n",
       "UX      3\n",
       "用户      2\n",
       "游戏      2\n",
       "H5      2\n",
       "腾讯      2\n",
       "创意      2\n",
       "动画      2\n",
       "鹅粉投稿    2\n",
       "品牌设计    2\n",
       "情感      2\n",
       "原型      1\n",
       "文字      1\n",
       "广告      1\n",
       "用户体验    1\n",
       "用户研究    1\n",
       "直播      1\n",
       "回顾      1\n",
       "报告      1\n",
       "Qzone   1\n",
       "CSS     1\n",
       "微视      1\n",
       "需求      1\n",
       "小程序     1"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 分类标记数量\n",
    "df_stats_1= df_o.groupby(by=\"variable\").agg({\"标题\":\"count\"}).sort_values(by=\"标题\", ascending=False)\n",
    "df_stats_1.index.name = '类型'\n",
    "df_stats = df_stats_1.rename(columns={\"标题\":\"数量\"})\n",
    "df_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第七步：输出"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_account.columns.name = \"SERP_accounts\"\n",
    "df.columns.name = \"分类_url\"\n",
    "df_stats.columns.name = \"stats_各分类\"\n",
    "\n",
    "_df_.columns.name\n",
    "\n",
    "with pd.ExcelWriter(fn[\"output\"][\"公众号_xlsx\"].format(公众号=\"腾讯ISUX_Selenium\")) as writer:\n",
    "    workbook  = writer.book\n",
    "\n",
    "    for _df_ in [df_account, df, df_stats]:\n",
    "        _df_.to_excel(writer, sheet_name = _df_.columns.name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "247.333px",
    "left": "1045.33px",
    "top": "110px",
    "width": "234.667px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
